Day 17 - Regular expressions - Groups
– _They come in groups of threes.__
Escape from New York (1981)
Oh, groups. How much do I love groups? Let’s see, definitely less than pizza, but after all that’s an
easy win. Well, culinary comparisons aside, I consider groups one of the best features of regular
expressions, both for their expressive power and for the fact that they introduce a certain tool that
is usually provided by programming languages only: variables.
Let’s start considering why groups are important. Generally speaking groups allow you to isolate
specific parts of the regular expression and reuse them later, but the way you can reuse them depends
on the tool that you are using. Groups are particularly useful with sed, as they greatly improve the
search and replace syntax.
The following code uses sed to parse a regular expression (enabled by the option -r), isolate a group
with the parenthesis and use the matched pattern in the replacement string.
$ echo "10:30" | sed -r s,"([0-9]{2}).*","H:\1",
H:10
The regular expression tries to match 2 adjacent digits ([0-9]{2}), but putting them in a group with
the parenthesis makes sed store the value in a variable. In this case the value stored is 10, and the
rest is ignored by the .* that matches anything. As you can see the way sed uses to access the value
of groups stored during the search is a backslash followed by a number. The first (and only, in this
case) group is named \1, the second \2, and so on. So, the replacement pattern is a string beginning
with H: and followed by the value matched and stored during the search, that is 10.
Let’s try something a bit more complex, with two groups.
$ echo "10:30" | sed -r s,"([0-9]{2}):([0-9]{2})","H:\1 - M:\2",
H:10 - M:30
This time the regular expression looks for two digits in the first group, then for a colon, and then
for another group of two digits. The two groups are named \1 and \2, so they can be used in the
replacement string. Please note that the colon is not included in any group as we don’t need to store
it, it is just an anchor to separate the two parts of the time.
Groups can include other groups, but it is important to note that the outer group still matches the
whole expression. An example can clarify the matter. If you run